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[DESCRIPTION] 
[Invention Title] 

The Methods and Apparatus for Blind Separation of Multichannel Convolutive Mixtures 
in the Frequency-domain 

[Technical Field] 

This invention relates to signal processing, more particularly to a method, apparatus, and 
storage medium that contains a program for performing blind signal separation of multichannel 
convolutive mixtures in the frequency domain. 



[Background Art] 

In the art of speech processing, it is necessary to separate mixtures of multiple signals 
(including speech signals) from multiple sensors in a rnultipath environment. Such a separation of 
the mixtures without a priori knowledge of signals is known as blind source separation (BSS). BSS 
15 is very useful to separate signals that are from independent sources such as multiple speakers and 
sonar arrays. BSS techniques maybe applied to speaker location tracking, speech recognition, 
speech coding, 3-D object-based audio signal processing, acoustic echo cancellers, channel 
equalization, estimation of direction of arrival, and detection of various biological signals such as 
EEG and MEG. 

20 Most BSS techniques try to recover the original signals by nullifying the effect of multi- 

path effects. Although filters of infinite length are required for this purpose in general, filters of 
finite length also provide sufficient separation in most real world environments. 

There are two popular approaches to this BSS problem: (i) multiple decorrelation (MD) 
methods that exploit the second order statistics of signals as independence measure and (ii) 

25 multichannel blind deconvolution (MBD) methods that exploit the higher order statistics. 

The MD methods decorrelate mixed signals by diagonalizing second order statistics. [See, 
e.g. E. Weinstein, M. Feder, and A. V. Oppenheim, "Multi-channel signal separation by 
decorrelation," IEEE Trans. Speech Audio Processings vol. 1, no. 4, pp. 405-413, Apr. 1993; Lucas 
Parra and Clay Spence, "Convolutive blind source separation of nonstationary sources", IEEE 

30 IEEE Trans. Speech Audio Processing, pp.320-327, May, 2000; D.W.E. Schobben and RC.W. 
Sommen, "A frequency-domain blind signal separation method based on decorrelation," IEEE 
Trans. Signal Processing, vol. 50, no. 8, pp. 1855-1865, Aug. 2002; N. Murata and S. Ikeda, and A. 
Ziehe, "An approach to blind source separation based on temporal structure of speech signal," 
Neurocomputing, vol. 41, no. 4, pp. 1-24, 2001] Diagonalization should be performed at multiple 

35 time instants for successful separation of signals. For this reason, these methods are only applied to 
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nonstationary signals. These methods are quite fast and stable. The MBD methods, on the other 
hand, separate signals by minimizing mutual information of nonlinear-transformed separated 
signals which are transformed by a nonlinear function matched to statistical dis-tributions of signals. 
[See, e.g. S. Amari, S.C. Douglas, A. Cichocki, H.H. Yang, "Novel on-line adaptive learning 
5 algorithm for blind deconvolution using the natural gradient approach", Proc. LEEE 11th IFAC 
Symposium on System Identification, Japan, 1997, pp. 1057-1062; A. J. Bell and T. J. Sejnowski, 
"An information maximization approach to blind separation and blind deconvolution," Neural 
Computation, 7, no. 6, pp. 1129-1159, Nov. 1995; L. Zhang, A. Cichocki, and S. Amari, 
"Geometrical structures of FIR manifolds and their application to multichannel blind 
10 deconvolution," Proc of Int. IEEE Workshop on Neural Networks and Sigfial Processing, pp. 303- 
312, Madison, Wisconsin, USA, Aug. 23-25, 1999] 
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[Disclosure] 
[Technical Problem] 

In the prior art, the separation performances are significantly limited due to their 
shortcomings such as frequency permutation, whitening, and filter types employed. 
5 The MD methods suffer from the frequency permutation problem — the separated sources 

are differently ordered in each frequency bin so that the resulting separated signals are still mixed. 
Although there are some solutions to this permutation problem, separation performance is degraded 
as the length of separating filters increase. On the other hand, the MBD methods suffer from 
whitening effect - the spectra of separating signals are whitened (or flattened). The linear 

10 predictive method for speech signals has been proposed as a solution for this shortcoming of the 
MBD methods. [See, e.g., S.C. Douglas, "Blind separation of acoustic signals", in Microphone 
Arrays: Signal processing techniques and applications, M. Brandstein and D. Ward Eds, Springer, 
pp. 355-380, 2001.] This method employs bidirectional filters that may be inappropriate normal 
mixing environments in practice. In addition, parts of room impulse response may be treated as 

1 5 vocal track response of human speech signals. 

Therefore, there is a need for a BSS technique that separates speech signals fast and 
accurately with high speech quality. 

[Technical Solution] 

20 This invention provides a method and apparatus of multichannel blind deconvolution, that 

estimates unidirectional separating filters for blind signal separation, with normalized natural 
gradient in the block frequency domain. 

Figure 1 depicts a system 100 for executing signal separation of the invention. The system 
100 comprises an input device 126 that supplies the mixed signals that are to be separated and a 

25 computer system 108 that executes the frequency-domain normalized multichannel blind 

deconvolution routine 124 of the present invention. The input device 126 may contain any types of 
devices, but is illustratively shown to contain a sensor array 102, a signal processor 104 and a 
recorded signal source 106. The sensor array 102 contains one or more transducers 102A, 102B, 
102C such as microphones. The signal processor 108 digitizes a (convolutive) mixed signal. 

30 The computer system 108 comprises a central processing unit (CPU) 114, a memory 122, 

input/output (I/O) interface 120, and support circuits 116. The computer system is generally 
connected to the input device 110 and various input/output devices such as a monitor, a mouse, and 
a keyboard through the I/O interface 120.The support circuit 116 comprises well-known circuits 
such as power supplies, cache, timing circuits, a communication circuit, bus and the like. The 

35 memory 122 may include random access memory (RAM), read only memory (ROM), disk drive, 
tape drive, flash memory, compact disk (CD), and the like, or some combination of memory 
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devices. The invention is implemented as the frequency-domain normalized multichannel blind! 
deconvolution routine 124 that is stored in memory 122 and executed by the CPU 114 to proce ss 
the signals from the input devices 126. As such, the computer system 108 is a general purpose 
computer system that becomes a specific purpose computer system when executing the routine 124 
5 of the present invention. The invention can also be implemented in software, hardware or a 
combination of software and hardware such as application specific integrated circuits (ASIC), 
digital signal processor, and other hardware devices. 

The illustrative computer system 108 further contains speech recognition processor 118, 
such as a speech recognition circuit card or a speech recognition software, that is used to proce ss 

10 the separated signals that are extracted from the mixed signal by the invention. As such, mixed 

signals in a room having more than two persons speaking simultaneously with background noise or 
music can be captured by the microphone array 102. The speech signals captured by the 
microphones 102 are mixed signals that should be separated into individual components for speech 
recognition. The mixed signal is sent to the computer system 108 after filtered, amplified, and 

15 digitized by the signal processor 104. The CPU 1 14, executing the frequency-domain normalized 
multichannel blind deconvolution routine 124, separates the mixed signals into its component 
signals. From these component signals, background noise can be removed easily. The component 
signals without noise are then applied to the speech recognition processor 118 to process the 
component signals into computer text or computer commands. In this manner, the computer system 

20 108, executing the frequency-domain normalized multichannel blind deconvolution routine 12-4, is 
performing signal preprocessing or conditioning for speech recognition processor 118. 

Figure 2a is a block diagram of the invention, a frequency-domain normalized 
multichannel blind deconvolution 124. The frequency-domain normalized multichannel blind 
deconvolution of the invention comprises a separation part 201, a nonlinear transformer 202, aaid a 

25 filter updating part 203 that updates separating filter coefficients using the normalized natural 

gradient. The separation part 201 separates a mixed multichannel signal x(k) . The mixed signal 
x(k) is observed in a multipath environment as the output of the n sensors to the m component 
signals and is defined by the following equation: 

xW = hW,x 2 w,.,.„wf (1) 
30 where Xj{k) is the mixed signal from the y-th sensor. The separating filter to separate x(k) into 
its component signals is an m x n matrix W(z,&) whose (i 9 j) component is represented by the 
following equation: 

w y (z,k)^w iJtP (k)z^ (2) 
where L is the length of the separating filters. The separated component signal u(&) is defined by 

-4- 



WO 2005/083706 



PCT/KR2005/000526 



the following equation: 

n(*) = [«i(*Xi*2(*V"^(*)f (3) 
Where u. (k) is the z-th separated signal defined by the following equation: 

n 

w iW = I\,(^(^A i = l,->,m (4) 

5 Figure 2b depicts a separating process for the case of w=w=2. The separated signal u(k) 

from the separation part 201 is applied to the nonlinear transformer 202. 

The nonlinear transformer 202 performs transformation of the separated signal through a 
memoryless nonlinear function so that the nonlinear-transformed signal has a uniform probability 
density. The nonlinear transformation is defined by the following equation: 
10 y i (k) = f(u i (k)), i = l 9 - 9 m (5) 

Figure 2c is an illustration of the nonlinear transformation that a signal with Laplacian 
probability density is mapped into a signal with uniform probability density. A function to be used 
in the nonlinear transformation is closely related to the probability density. For audio and speech 
signals, a sgn(w) or tanli(w) is used in general. 

15 The filter updating part 203 updates the separating filter coefficients using the steepest 

ascent rule with natural gradient by the following equation: 

w v,p (*+!) = w iup (*) + Mw,,, (*) (6) 
for 1 < i < m, 1 < j<n, 0< p<L-l, where jj. is the step size and Aw iJp (k) is the natural gradient 
defined by the following equation: 

m p 

20 ^jAk) = Aw iJ<p (k)-^^Mk)u t (k-p + q)w lj , q (k) (7) 

Wliere y^k) and u^k) are the frequency-domain normalized versions, having flat spectrum, of 

y.(k) and u x (k) , respectively. Note also that the filter lag q in equation (7) is limited up to p not 

up to LA. In this invention the separating filter is unidirectional of length L. Thus no sample delay 
is required. 

25 In this invention, the above mentioned process is performed in the frequency domain in an 

overlap-save maimer to take the advantage of the FFT (Fast Fourier Transform). The filter length, 
the block length, the frame length are denoted as i, M, N 9 respectively. The amount of overlapping 
between frames is determined by the ratio r=N/M. In the sequel, 50% overlap is assumed (r=2) and 
the FFT size is assumed to be equal to the frame length for simplicity. 

30 Figure 3 depicts a flow chart of an embodiment of this invention, the frequency-domain 

normalized multichannel blind deconvolution. With reference to the flow chart, the mixed signal 
x(k) is input at step 301 . At step 302, the mixed signal forms a current frame of two (r=2) 
consecutive blocks of M samples as follows: 
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x/6) = [jc/6M-2M + l),.-.,x / (6M)f,7 = l,...,« (8) 

where b denotes the block index. At step 303, the mixed signal is separated using the separating 
filters 

w„ (b) = [w &> o , w w , • • • , w, iM f (9) 
5 The separating filters generally initialized as 

w & .(0) = [l,0 9 .",0f, z = y (10a) 

w,(0) = [0,».,0] r , (10b) 

If there is any useful information on the separating filters, however, the information can be utilized 
into initialization of the separating filters. The separated signal is computed in the frequency 
10 domain using circular convolution as in the following equation: 

"t (f,b) = Z w, (f,b) O x, (f 9 b) (1 1) 

7=1 

where O denotes the component-wise multiplication, and / denotes the frequency domain 
quantity such that 

w,(/,6)«Fw,(ft) (12a) 
15 Xj(f 9 b) = Yxj(b) (12b) 

where F is the NxN DFT matrix. The separated signal is then transformed back into the time 
domain in order to discard the first L aliased samples as in the following equation: 

= P 0 ^-zFO|^ (13) 
where T? 0tN _ L is the projection matrix (or window matrix) to make first L samples to zeros and is 
20 defined as follows: 



V 0 *N-LJ 



(14) 



where 0 L is the LxL zero matrix and 1^ is the (N-L)x(N-L) identity matrix. 

At step 304, the separated signal is transformed via a nonlinear function in the time 
domain. One of two following equations can be used: 
25 y l ib)=f(u l tb)) = l0 r --,0,fW (15a) 

y,@) = /(«,(*)) = [<V-^0,^ (15b) 
The output of this nonlinear function is used to compute the cross-correlations 
f(u t (k))iij(k-p), p - 0,1,—, £-1 at step 306. If equation (15a) is used, the cross-correlations will 

be biased. If equation (15b) is used, the cross-correlations will be unbiased. 
30 At step 305, the alias-free normalized cross-power spectra are computed. Step 305 is very 

critical in this invention. The normalized cross-power spectrum is defined by the following 
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equation: 



P(6) = 



(16) 



Where P is the normalized cross-power spectrum between y^f^b) and u y (/,fr) to be 

described below. If r = y, the expected value is normalized to 1 by Bussgang property. At step 306, 
5 the cross-power spectra are computed in the frequency domain by the following equation: 

r„ J (f,b) = y t (f 9 b)Ou J (f,b) (17) 

where * denotes the complex conjugation and 

y,(/,6) = Fy,.(6) (18a) 

u,</,6) = Fu,<6) (18b) 

10 Note that the cross-power spectra in equation (17) are computed using only the samples from the 

current frame as in equation (18a) and (18b). At step 307, the power spectra of the separated signals 
and the nonlinear-transformed signals are computed to normalize the cross-power spectra. In order 
to accommodate time varying nature of the signal, the power spectra are updated at each block as 
follows: 

15 b) = (l-y)1> yi (f,b-l) + y\y i (f 9 b)\\ i = h-,m (19a) 

P M , (f,b) = d-r)*., (f,b -l)+r|u,(/,Z>)f , J = l,...,m (19b) 
Here, y is a constant between 0 and 1 . The power spectra are initialized as 
Vy, (/> 0) = P M , (/» 0) = c[l, • • • , if , j = 1, • - ■ , m , where c is a small positive constant 0 < c <§c 1 . At step 
308, the cross-power spectra are normalized as follows: 

P v « iffb) 

20 = . m (20) 

where the division is performed in the component-wise. If the cross-power spectra in equation (20) 
are transformed back into the time domain, however, the resulting cross-correlations contain aliased 
parts. Furthermore, only the first L cross-correlations are required to compute the natural gradient 
in equation (7). Therefore, only the first L cross-correlations must be extracted. This is performed 
25 at step 309 by applying the proper time domain constraint in the time domain as follows: 

P m (f,b) = FP^F-'P^ (f,b) (21) 

where F" 1 is the NxN inverse DFT matrix and P i0 is the NxN projection matrix, which 
preserves the first L samples and set the rest (N-L) samples to zeros, defined as 
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(22) 



At step 310, the natural gradient is computed using the nonholonomic constraints as 

follows: 

T-P y Af,b), for i = j 
V yiUj (f>t>) = i . _ - . . (23a) 



\-1> yiUj (f,b), for i*j 



m 

5 Aw,(/,£>) = £P W (/,&) O Wy(f,b) (23b) 

7=1 

where 1 = [1,- • • 9 lf . The nonholononiicity implies that separation is not responding to signal 
powers but only to statistical dependence between signals. 

Note that P (f,b) in equation (23a) is approximately nonholonomic since the diagonal 

components P (f,b) are 1 on the average. However, exact nonholonomicity can be attained 

10 by forcing the diagonal components to zeros as: 

P, /W ,t/>) = 0 (24) 

Although all the components of the separating filters are learned in general, all diagonal 
components can be omitted in learning so that the diagonal components are absorbed into the off- 
diagonal components. This is easily achieved in this invention by setting the diagonal components 
15 of the gradient to zero as follows: 

Aw,.(/,ZO = 0 (25) 
If equation (24) and (25) are combined together, the computation can be reduced. Note that, for the 
special case of m=n=2, the time-domain constraints in equation (21) are not necessary and the 
computational burden is significantly reduced. Such flexibility for modifications is one advantage 
20 of the present invention. 

At step 311, the separating filters are updated as: 

w, y (f,b + 1) = w, (f,b) + juAw 0 (f,b) (26) 

At step 312, the separating filters are normalized in the frequency domain to have unit norm. The 
separating filters with unit norm preserve signal power during iteration. 
25 At step 313, termination conditions are investigated whether the separating procedure 

should be terminated or not. 

At step 314, the converged separating filters are used to filter the mixed signals to get the 
separated signals. Equation (11) in step 302 can also be used in this step. 

Although various embodiments which incorporate the teaching of the present invention 
30 have been shown and described in detail herein, those skilled in the art can readily devise many 
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other varied embodiments that still incorporate these teachings. Accordingly, it is intended that all 
such alternatives, modifications, permutations, and variations to the exemplary embodiments can 
be made without departing from the scope and spirit of the present invention. 

5 [Advantageous Effects] 

Figure 4a shows an example of separating mixed signals recorded in a real-world 
environment. Speech and music signals are recorded in a room using two microphones and the 
mixed signals are then separated using the inventive method. Figure 4a shows two mixed signals 
x = (jq, x 2 ) and two separating signals u = (u { ,u 2 ) from top to bottom. Parameters used are i=128, 
10 M=2L, N=2M 9 ju = 0.0025 . Figure 4b shows the final separating filters in this example. 

This invention can separate a desired signal from the mixtures with high speech quality so 
that the separated signal can be directed to a speech recognizer or a speech coder. Figure 5 shows 
the original signal s , the mixed signal x , and the separated signal u from top to bottom for each 
channel. Figure 5 demonstrates high quality of the separated speech signals. 

15 

[Description of Drawings] 

The teaching of the present invention can be readily understood by considering the 
following description in conjunction with accompanying drawings, in which: 

Fig. 1 depicts a system for executing a software implementation of the present invention; 
20 Fig. 2a depicts a block diagram of a multichannel blind deconvolution using normalized 

natural gradient; 

Fig. 2b depicts a diagram of separating filters to separate mixed multichannel signals; 

Fig. 2c depicts a schematic graph of transforming a separated signal into a signal with 
uniform probability density using a nonlinear function; 
25 Fig. 3 depicts a flow chart of an embodiment of the present invention; 

Fig. 4a depicts separated signals, speech and music, from mixtures recorded in a real room 
by the present inventive method; 

Fig. 4b depicts the final converged separating filters w,. to separate the mixtures 

recorded in a real room by the present inventive method; and 
30 Fig. 5 depicts an original speech signal s , a mixed speech signal x , and a separated 

signal u for each channel. 

[Industrial Applicability] 

The present invention finds application in a speech recognition system as a signal 
35 preprocessor system for deconvolving and separating signals from different sources such that a 
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speech recognition processor can response to various speech signals without interfering noise 
sources. 
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